fix: prevent WS-broadcast OOM crash under connection churn by skialpine · Pull Request #1 · skialpine/openscale

skialpine · 2026-05-25T22:27:39Z

Stacked on feat/scale-telemetry (PR decentespresso#57). Base will be retargeted to main once decentespresso#57 merges. The telemetry in decentespresso#57 (reset_reason, heap) is what made this diagnosable.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate() → abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix

wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
-D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

Free heap driven to 6436 bytes (old build crashed at ~4684).
Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
No abort, no reboot, weight stream uninterrupted (uptime continuous).

🤖 Generated with Claude Code

Diagnostics for the field "weight stops being collected under sustained load" reports (suspected thermal). Adds to the WS status frame and serial logs: - soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature (temperatureRead()), sampled every 2s on the main loop. - weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s (a live cell dithers every sample), recording the die temp at the moment of the stall to correlate failures with heat. (This failure is not firmware- recoverable, so it's surfaced, not silently retried.) - reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT reset is attributable instead of looking like a clean power-on. Telemetry-only; no behavior change to the weight/WiFi/BLE paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nitor) Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and polls the new temp/stall telemetry every ~60s, watching for the weight-stall failure and the die temp at which it occurs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Review follow-ups on the telemetry watchdog: - Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle freezes the raw value by design); re-seed the window on resume so a genuine signal-timeout recovery isn't miscounted as a railed/frozen stall. - Check every 250 ms instead of every loop iteration -- the ADC only produces ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt + dataset passes) just burns CPU/heat, which is counterproductive on the chip we're trying to characterize. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port. - Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full status interval after (re)connect, tracks peak temp / stalls / recoveries / reboots across the whole run (so a firmware reset doesn't lose the peak), and prints a SUMMARY line with a PASS/FAIL verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- thermal_load_test.sh: close the silent-PASS holes a review found. A flapping wedge (scale answers one frame per reconnect, resetting the consecutive-miss streak) now fails via a cumulative total_no_status counter, not just max_no_status_streak. Each load generator's PID is captured and waited on individually so a never-started/crashed driver (non-zero exit) fails the run instead of being missed by a Traceback-only grep. A run that never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since the thermal data the test exists to capture is absent. - CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get fixed in the same change, not deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root-caused from a captured panic backtrace: under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses and AsyncWebSocket's printfAll path allocates an AsyncWebSocketMessage per client -> operator new throws std::bad_alloc -> (Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot. That OOM-reboot is the "weight stops being collected under load" failure (not thermal -- die temp was 33 C). Decoded stack: operator new -> __cxa_throw -> std::terminate -> abort AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490) AsyncWebSocket::printfAll sendWebsocketWeightAll (websocket.h) <- loop() 10 Hz broadcast Fix: - Heap-gate every broadcast-to-all helper (weight, status, button, power-off) with wsBroadcastHeapOk(): skip the frame when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog) instead of allocating into an exhausted heap and crashing. Dropping a frame is invisible; the next is <=500 ms away. - Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32) so a backed-up/half-open client can't hoard heap. - Document the footgun in CLAUDE.md (notes + troubleshooting table). Stacked on the scale-telemetry branch (PR decentespresso#57) whose reset_reason / heap telemetry made this diagnosable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings from the code review on cb569d3: 1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame is invisible (the next weight frame is <=500 ms away, status <=5 s); crashing is not." The "status <=5 s" recovery guarantee no longer holds after this PR removes the only periodic status broadcaster. Rewrote to describe what's actually still gated by wsBroadcastHeapOk (weight + button + power_off). 2. include/websocket.h:243-247 had an addendum paragraph framing the surviving printfAll comment as "Note: status is no longer broadcast periodically...". The historical framing reads as documentation for a function that no longer exists; dropping it leaves the printfAll technical commentary applying cleanly to sendWebsocketWeightAll (which is what it now sits above). 3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch directly from the AsyncTCP task. CLAUDE.md threading model #1 footgun: "stopWatch.* -- No -- multi-field (running flag + start ts + accumulator) and also mutated from loop(), BLE and USB; a status- frame read can tear a write across tasks." The bug is pre-existing on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you find along the way" applies. Mirroring the snapshot pattern used by PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h, refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial same-lines conflict that git auto-resolves.) Verified on hardware: flash + 12s window with events on + rate 10k -> 121 weight frames at 10 Hz, 0 periodic status frames; on-request {"command":"status"} returns the full 16-field frame with timer_running/timer_seconds populated from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skialpine and others added 8 commits May 25, 2026 12:24

skialpine force-pushed the feat/scale-telemetry branch from f3adb84 to 62a09f9 Compare May 27, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent WS-broadcast OOM crash under connection churn#1

fix: prevent WS-broadcast OOM crash under connection churn#1
skialpine wants to merge 8 commits into
feat/scale-telemetryfrom
fix/ws-oom-broadcast

skialpine commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skialpine commented May 25, 2026

Root cause (from a captured + decoded panic backtrace)

Fix

Verification (on hardware)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant